General

Linear Regression Using Clickhouse Machine Learning Functions

Benjamin Wootton

2024-07-11

8 min read

Featured image for Linear Regression Using Clickhouse Machine Learning Functions

This article is part of a series where I look at doing data science work within ClickHouse. Articles in this series include forecasting, anomaly detection, linear regression and time series classification.

Though this type of analysis would more typically take place outside of ClickHouse in a programming language such as Python or R, my preference is to take things as far as possible using just the database.

By doing this, I can rely on the power of ClickHouse to process large datasets with high performance, and reduce or even totally avoid the amount of code that I need to write. This also means that I can work with smaller in-memory datasets on the client side and potentially avoid the need for distributed computation using frameworks such as Spark.

About This Example

In this article, I will carry out a simple linear regression analysis, which I will use to predict delivery times based on two variables - the distance of the delivery and the hour the package was picked up for delivery.

I will work with and render geographical data as part of the analysis, for instance making use of Clickhouse's geoDistance function to calculate distances based on geographical coordinates.

Dataset

My dataset is a small extract of this last-mile delivery dataset by Hugging Face.

Though the dataset is large and detailed, I will look at a subset of 2,293 orders delivered by a single courier, number 75, in region 53 of the Chinese city of Jilin in order to make it easier to follow the example.

A preview of the data is shown below. I only use the columns with the times and locations of the courier's pickups and deliveries, in addition to the order ids.

SELECT *
FROM deliveries
LIMIT 5

Query id: f2fb9292-3b14-4bc1-b9dd-780e941e1fee

┌─order_id─┬─────accept_gps_time─┬─accept_gps_lat─┬─accept_gps_lng─┬───delivery_gps_time─┬─delivery_gps_lat─┬─delivery_gps_lng─┐
│     7350 │ 2022-07-15 08:45:00 │       43.81204 │       126.5669 │ 2022-07-15 13:38:00 │         43.83002 │         126.5517 │
│     7540 │ 2022-07-21 08:27:00 │       43.81219 │      126.56692 │ 2022-07-21 14:27:00 │         43.82541 │        126.55379 │
│     7660 │ 2022-08-30 08:30:00 │       43.81199 │      126.56993 │ 2022-08-30 13:52:00 │         43.82757 │        126.55321 │
│     8542 │ 2022-08-19 09:09:00 │       43.81219 │      126.56689 │ 2022-08-19 15:59:00 │         43.83033 │        126.55078 │
│    12350 │ 2022-08-05 08:52:00 │       43.81215 │      126.56693 │ 2022-08-05 09:10:00 │         43.81307 │        126.56889 │
└──────────┴─────────────────────┴────────────────┴────────────────┴─────────────────────┴──────────────────┴──────────────────┘

5 rows in set. Elapsed: 0.030 sec. Processed 2.29 thousand rows, 64.18 KB (75.64 thousand rows/s., 2.12 MB/s.)
Peak memory usage: 723.95 KiB.

Using my Hex Notebook, I can easily render a heatmap of the delivery locations around Jilin, observing that more deliveries occur in central areas:

Hex notebook screenshot: Linear Regression (1)

My model will also take account of the pickup time as a second variable. Therefore, I will also visualise the distribution of the number of orders by pickup hour and can observe that most packages are collected at 8am in the morning.

Hex notebook screenshot: Linear Regression (2)

Data Preparation

My model will predict the time elapsed between the pickup and delivery (in minutes) as a function of the distance between the pickup and the delivery locations (in meters) and of the pickup hour.

I use Clickhouse geoDistance function for calculating the distance between the pickup and the delivery locations given their coordinates (latitude and longitude), while I use Clickhouse date_diff function for calculating the time elapsed between pickup and delivery.

I also add to the dataset a randomly generated training index using randUniform function, which is equal to 1 for 80% of the data, which will be used for training, and equal to 0 for the remaining 20% of the data, which will be used for testing performance of the model.

create or replace table 
    deliveries_dataset (
        order_id UInt32,
        delivery_time Float64,
        delivery_distance Float64,
        Hour7 Float64,
        Hour8 Float64,
        Hour9 Float64,
        Hour10 Float64,
        Hour11 Float64,
        Hour12 Float64,
        Hour13 Float64,
        Hour14 Float64,
        Hour15 Float64,
        Hour16 Float64,
        training Float64
    )
engine = MergeTree
order by order_id

insert into
    deliveries_dataset
select 
    order_id,
    date_diff('minute', accept_gps_time, delivery_gps_time) as delivery_time,
    geoDistance(accept_gps_lng, accept_gps_lat, delivery_gps_lng, delivery_gps_lat) as delivery_distance,
    if(toHour(accept_gps_time) = 7, 1, 0) as Hour7,
    if(toHour(accept_gps_time) = 8, 1, 0) as Hour8,
    if(toHour(accept_gps_time) = 9, 1, 0) as Hour9,
    if(toHour(accept_gps_time) = 10, 1, 0) as Hour10,
    if(toHour(accept_gps_time) = 11, 1, 0) as Hour11,
    if(toHour(accept_gps_time) = 12, 1, 0) as Hour12,
    if(toHour(accept_gps_time) = 13, 1, 0) as Hour13,
    if(toHour(accept_gps_time) = 14, 1, 0) as Hour14,
    if(toHour(accept_gps_time) = 15, 1, 0) as Hour15,
    if(toHour(accept_gps_time) = 16, 1, 0) as Hour16,
    if(randUniform(0, 1) <= 0.8, 1, 0) as training
from 
    deliveries

When visualised, delivery distance and delivery time are positively correlated with greater variance as journeys get longer. This is intuitively as I would expect as longer journeys become harder to predict.

Hex notebook screenshot: Linear Regression (3)

Model Training

I use Clickhouse's stochasticLinearRegression function for fitting the linear regression model based on the 80% of my dataset which contains training data.

Given that this function uses gradient descent, I scale the delivery distance (which is the only continuous feature) by subtracting the training set mean and dividing by the training set standard deviation. I take the logarithm of the target to make sure that the time to delivery predicted by the model is never negative.

create view 
    deliveries_model 
as with
    (select avg(delivery_distance) from deliveries_dataset where training = 1) as loc,
    (select stddevSamp(delivery_distance) from deliveries_dataset where training = 1) as scale
select
    stochasticLinearRegressionState(0.1, 0.0001, 15, 'SGD')(
        log(delivery_time), 
        assumeNotNull((delivery_distance - loc) / scale),
        Hour7,
        Hour8,
        Hour9,
        Hour10,
        Hour11,
        Hour12,
        Hour13,
        Hour14,
        Hour15,
        Hour16
    ) 
as 
    state
from 
    deliveries_dataset 
where training = 1

Model Evaluation

I can now use the fitted model to make predictions for the remaining 20% of my dataset. I will do this by comparing the predicted delivery times with the actuals to calculate the accuracy of my model.

create view 
    deliveries_results
as with
    (select avg(delivery_distance) from deliveries_dataset where training = 1) as loc,
    (select stddevSamp(delivery_distance) from deliveries_dataset where training = 1) as scale,
    (select state from deliveries_model) as model
select
    toInt32(delivery_time) as ACTUAL,
    toInt32(exp(evalMLMethod(
        model, 
        assumeNotNull((delivery_distance - loc) / scale),
        Hour7,
        Hour8,
        Hour9,
        Hour10,
        Hour11,
        Hour12,
        Hour13,
        Hour14,
        Hour15,
        Hour16
    ))) as PREDICTED
from
    deliveries_dataset 
where training = 0

I now have a table of ACTUAL delivery times and PREDICTED delivery times for the 20% test portion of my dataset.

SELECT *
FROM deliveries_results
LIMIT 10

Query id: 2a4cee84-bf43-4925-947e-c9f0a149c05f

┌─ACTUAL─┬─PREDICTED─┐
│    410 │       370 │
│    101 │       122 │
│    361 │       214 │
│    189 │        69 │
│    122 │        92 │
│    454 │       365 │
│    155 │       354 │
│    323 │       334 │
│    145 │       153 │
│     17 │        20 │
└────────┴───────────┘

10 rows in set. Elapsed: 0.015 sec. Processed 9.17 thousand rows, 267.76 KB (619.10 thousand rows/s., 18.07 MB/s.)
Peak memory usage: 2.28 MiB.

I can also plot these visually as per below in my notebook:

Hex notebook screenshot: Linear Regression (4)

To explain the plot, if the model was performing perfectly, then I would expect PREDICTED and ACTUAL to match in every case, meaning that all points would line up on the orange curve. In reality, my model did have errors which I will now analyse.

Model Performance

Looking at the visualisation above, I can see that my model performed relatively well for shorter journeys less than 120 minutes, but predictive accuracy begins to fall away for longer distance journeys as they become more complex and harder to predict.

This would be in line with my real-world experience whereby the longer and more arduous a journey is, the harder it is to predict.

More scientifically, I can evaluate the models performance by looking at the model's mean absolute error (MAE) and root mean squared error (RMSE). This gives me a value of approximately 1 hour across the entire dataset:

SELECT
    avg(abs(ACTUAL - PREDICTED)) AS MAE,
    sqrt(avg(pow(ACTUAL - PREDICTED, 2))) AS RMSE
FROM deliveries_results

Query id: 0082d2a2-4735-4cf6-8d0e-dba4e010a918

┌───────────────MAE─┬──────────────RMSE─┐
│ 58.18494623655914 │ 78.10208373578114 │
└───────────────────┴───────────────────┘

1 row in set. Elapsed: 0.022 sec. Processed 9.17 thousand rows, 267.76 KB (407.90 thousand rows/s., 11.91 MB/s.)
Peak memory usage: 2.28 MiB.

If I limit this to just the shorter journeys with an ACTUAL of less than 2 hours (120 minutes), then I can see that my model performs better with an MAE and RMSE closer to 30 minutes:

SELECT
    avg(abs(ACTUAL - PREDICTED)) AS MAE,
    sqrt(avg(pow(ACTUAL - PREDICTED, 2))) AS RMSE
FROM deliveries_results
WHERE ACTUAL < 120

Query id: edb8eef0-24a7-4579-a856-4e3f3e10cd20

┌────────────────MAE─┬──────────────RMSE─┐
│ 29.681159420289855 │ 41.68671981213744 │
└────────────────────┴───────────────────┘

1 row in set. Elapsed: 0.014 sec. Processed 9.17 thousand rows, 267.76 KB (654.46 thousand rows/s., 19.11 MB/s.)
Peak memory usage: 2.35 MiB.

Conclusion

In this article I have demonstrated how I can use a simple linear regression function to predict output values based on 2 input variables.

The performance of the model was resasonable at shorter distances, but began to break down as the output variable became harder to predict. That said, I can see that a simple linear regression conducted entirely within ClickHouse and using only 2 variables does have some predictive capability and may perform better in other datasets and domains.

Free Whitepaper

Why ClickHouse Should Be Your Next Database

How ClickHouse is differentiated from other analytical databases.

Download

Written by

Benjamin Wootton

Freelance Consultant - ClickHouse

Benjamin Wootton is an independent ClickHouse consultant. I help businesses deploy ClickHouse open source and ClickHouse Cloud, build solutions on top of ClickHouse for real-time analytics, observability and AI, and resolve performance and reliability issues with their existing deployments. Visit my home page to learn more.

Connect on LinkedIn

END OF FILE Share

Related Insights

GeneralPerformance

Introducing Query Dog, A Tool For Working With The ClickHouse Query Log

4 min read

GeneralPerformance

The Risk Of Misusing ClickHouse Partitions

8 min read

General

Comparing The TCO Of ClickHouse Cloud vs Open Source

8 min read